Breast cancer detection project Report

INTRODUCTION:

    Breast cancer is a group of diseases in which cell in the breast tissue change and divide uncontrolled , typically resulting in a lump or mass. Most breast cancers begin in the lobules or in the ducts that connect the lobules to the nipple. It typically has no symptoms when the tumor is small and most easily treated, which is why screening is important for early detection.
    Breast cancer is typically detected either during screening, before symptoms have developed, or after a woman notices a lump. Most masses seen on a mammogram and most breast lumps turn out to be benign.According to WHO (World Health Organisation) in 2020, there were 2.3 million women diagnosed with breast cancer and 685000 deaths globally. As of the end of 2020, there were 7.8 million women alive who were diagnosed with breast cancer inn the past 5 years, making it the world's most prevalent cancer.
    Based on current incidence rates, 12.9% of women born in the United States today will develop breast cancer at some time during their lives. In Indian women breast cancer accounts for 14% of cancers. It is reported that with every four minutes, an Indian woman is diagnosed with breast cancer. Breast cancer is on the rise, both in rural and urban India.
    The doctors do not identify each and every breast cancer patient. That's the reason Machine Learning Engineer/ Data Scientist comes into the picture because they have knowledge of maths and computational power.
Some Risk Factorsfor Breast Cancer:
    The following are some of the known risk factors for breast cancer. How ever, most cases of breast cancer cannot be linked to a specific cause.

  1. Women who menstruate for the first time at an early age (before 12).
  2. women who go through menupause late (after age 55).
  3. Women who've never had children.

Goal:

    We have extracted features of breast cancer patient cells and normal person cells. As Machine learning engineer/ Data Scientist has to create a Machine Learning model to classify "malignant (cancerous)" and "benign (not cancerous)" tumor. Since, the given dataset is a Supervised Learning data that supports Binary Classification. To achieve this I have used machine learning classification methods to fit a function that can predict the discrete class of new input.

Data Description:

Type:

    The given breast cancer dataset of the project has the labelled data means some of the input data [featured columns] is already tagged with the correct output data [ target column]. So, the given dataset is a Supervised dataset.
    In supervised learning, the training data provided to the machines work as the supervisor that teaches the machines to predict the output correctly. Supervised learning is a process of providing input data as well as correct output data to the machine learning model.

Columns:

    The given dataset has the feature columns and as well as the target columns:
Columns in the dataset :-
    'Unnamed: 0', 'id', 'diagnosis', 'radius_mean', 'texture_mean','perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean','concavity_mean', 'concave points_mean', 'symmetry_mean','fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se','area_se', 'smoothness_se', 'compactness_se', 'concavity_se','concave points_se', 'symmetry_se', 'fractal_dimension_se','radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst','smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst'.

Useful Columns:

    In our given dataset only some columns are useful to build the machine learning model. The useful columns are divided into two types feature columns and target column.
Feature columns :-
    In the feature columns we used all the columns except some of unwanted columns ['Unnamed: 0', 'id'] and the target column ['diagnosis'].
Target Columns :-
    The 'daignosis' column is used as the target column. The diagnosis column indicates the patients who's cancer is Malignant or Benign.

Problem Statement:

    Breast Cancer is one of the leading cancer developed in many countries including India. Though the endurance rate is high - with early diagnosis 97% women can survive for more than 5 years. Statistically, the death toll due to this disease has increased drastically in last few decades. The main issue pertaining to its cure is early recognition. Hence, apart from medicinal solutiondd=s some Data Science solution needs to be integrated for resolving the death causing issue.
    This analysis aims to observe which features are most helpful in predicting malignant or benign cancer and to see general trends that may aid us in model selection and hyper parameter selection. The goal is to classify whether the breast cancer is benign or malignant. To achieve this I have used machine learning classification methods to fit a function that can predict the discrete class of new input.

Approach:

    In this project we will use Machine learning Algorithms to detect breast cancer, based off of data. We have the data to build the Machine Learning Model. For that we will apply Exploratory Data Analysis and Data Preprocessing techniques on our loaded Breast Cancer Dataset.
    We have clean dataset to build the machine learning model. But which Machine Learning Algorithmns should we use to build the machine learning model? As we discussed in the above sections our dataset is a Supervised Learning Dataset and our target column has two classes that means our dataset supports Binary-Class Classification. So, we use Supervised Learning Classification Algorithms in our project.
    Finally, After completion of the machine learning project or building the Machine Learning Model need to deploy in an application. To deploy the Machine Learning Model need to save using pickle package.

Proposed Solution And Result Analysis:

Import Essential Libraries:
    *We are importing 'numpy' library for numeric calculation in our project, 'pandas' library for data manupulation or analysis, 'matplotlib' and 'seaborn' libraries for data visualization. We are importing 'sklearn libraries' for training and testing the data and building Machine Learning Model as well. We are importing 'pickel' library to save the Machine Learning Model.

Loading Dataset:
    We already downloaded breast cancer data in our device, we loading the dataset by use the pandas 'read_csv()'.

Data Exploration:

    We are exploring the data in the dataset using some methods and functions to find the columns in the data, column names in the dataset,etc.,

    The dataset has the 33 columns of 569 patients data that includes the patient-Id and diagnosis of the patients with another 30 columns of data.

    The diagnosis column is our target which contains the details of the patients who has the cancer and who doesn't havve cancer. That data in the diagnosis column is in the form of categorical format i.e. 'M' means 'malignant' and 'B' means 'benign'.

    Since,The data in the Target_Variables is a Categorical data, we need to Convert it into the numerical data. So that we can get accurate results While building Machine Learning Model.

    There are two types of unique values in the Target_Variables they are 'M' ana 'B'. We are going to convert the categorical data into numerical data by giving 'M = 0' and 'B = 1'.

    Now, we have numerical data in the Target. We can easily use this data to build Machine Learning Model.
    Now, We are going to drop unwanted columns, diagnosis column and assign the remaining columns into 'Feature_Variables'.

Creating a DataFrame:

    We are going to a Cancer Dataframe without losing our original dataset by concating 'FeatureVariables' and 'Target' with the help of pandas 'pd.DataFrame()' and for concatination of Target and Feature Variables we are using numpy 'np.c[x,y]'.

    We are going to find if there are any null Values in the Cancer Dataframe.

    After checking for the null values we got no null values in the Cancer Dataframe with the 'datatype: int64'.

    We are going to find the count, mean, standard deviation, minimum, maximum, 25%, 50% and 75% values of the Cancer Dataset.

    If we see the count all the columns has 569 values means there are no missing Values in the Cancer Dataset.

Data Visualization:

Pairplot of the Cancer DataFrame:
    The pair plot is used to show the numeric distribution in the scatter plot.

    For pairplot we gave the hue as "target" to find the numeric distribution of the Cancer_df with target values.

    The pair plot showing malignant and benign tumor data distributed in two classes. It is easy to differentiate in the pair plot.

Counterplot of the Target:
    Showing the total count of malignant and benign tumor patients in counterplot.

    In the counterplot the maximum samples is equal to 1, that means the non cancerous patients are maximum.

    The above counterplot has the maximum samples of mean radius is equal to 1.

Heatmap of Cancer DataFrame:

    In the above Heatmap the area_mean and area_worst are greater than the other features. perimeter_mean, area_se and perimeter_worst are slightly greater than other features.

Heatmap of Correlation matrix:
    To find a Correlation between each feature and target we visualize heatmap using the correlation matrix.

Boxplot of Cancer DataFrame:
    To find the outliers in the breast cancer dataframe we use boxplot.

    In the above boxplot the outliers are greater at area_worst, area_mean and also there are some outliers at area_se, perimeter_mean and perimeter_worst.

Correlation Barplot:
    We are taking the correlation of each feature with target and visualize the barplot. For that, We are going to create the another Cancer DataFrame by droping the target column.

    In the above correlation barplot only 'smoothness_se' is strongly positively correlated with the target than others. The features 'fractal_dimension_mean', 'texture_se', 'symmetry_se' are less positively correlated with target and others are strongly negatively correlated with target.

Data Preprocessing:

Split Cancer Dataframe in train and test:

    For train and test split we are dividing the Dataframe into the Input Variable (X) and output Variable (y).

    We divided the Cancer DataFrame the train data with '0.8' and test data with '0.2' and the random state is '2'.

Feature Scaling:
    We are feature scaling the Cancer DataFrame to convert different units and magnitude data in one unit.

Breast Cancer Detection Machine Learning Model Building:

    We have clean data to build the Machine Learning model. Since, our output variable is a categorical data so we use supervised classification machine learning algorithms. To build the best model, we have to train and test the dataset with multiple machine learning algorithms then we can find the best machine learning model.

ACCURACY SCORE OF THE DATASET WITH MULTIPLE MODELS:-

Support Vector Classifier:

    By using 'Support Vector Classifier Algorithm' we getting the results:

Logistic Regression Classifier:

    By using 'Logistic regression Classifier Algorithm' we getting the results:

K-Nearest Neighbour Classifier:

    By using 'K-Nearest Neighbour Classifier Algorithm' we getting the results:

Naive Bayes Classifier:

    By using 'Naive Bayes Classifier Algorithm' we getting the results:

Decision Tree Classifier:

    By using 'Decision Tree Classifier Algorithm' we getting the results:

Random Forest Classifier:

    By using 'Random Forest Classifier Algorithm' we getting the results:

AdaBoost Classifier:

    By using 'AdaBoost Classifier Algorithm' we getting the results:

Stochastic Gradient Descent Classifier:

    By using 'Stochastic Gradient Descent Classifier Algorithm' we getting the results:

Multi-Layer Preceptron Classifier:

    By using 'Multi-Layer Preceptron Classifier Algorithm' we getting the results:

    By observing accuracy scores of all the Machine learning models 'Logistic Regression' is giving the highest accuracy score with Standard Scaled data by comparing with other models. Eventhough the 'SGD Model' is giving highest accuracy score but the accuracy score has been changing gradully with the same parameters. So, We choose 'Logistic Regression Classifier' to build the machine learning model.

CONFUSION MATRIX:

    This model is giving 0% type II error and it is the best.

CLASSIFICATION REPORT OF MODEL:

CROSS-VALIDATION OF THE MACHINE LEARNING MODEL:

    To find the Machine Learning model is overfitted, underfitted or generalized doing cross-Validation.

    The mean accuracy value of Cross-Validation is 97.80% and Logistic Regression model accuracy is 97.36%. It showing Logistic Regression Model is slightly underfitted but when training data it will generalized model.

Save the Machine Learning Moodel:

    After completion of the building Machine Learning model, it needs to deploy in an application. To deploy the Machine learning model need to save it first. Too save the Machine Learning project we are using 'Pickle' package.

    We have completed the building and saving the Machine Learning Model with 97.36% accuracy rate.

Conclusion:

    We trained all the Supervised Classification Algorithmns to get more accuracy in the results. After training all the Algorithms, we found that 'Support Vector Machine', 'Logistic Regression', 'Stochastic Gradient Descent' and 'Multi-Layer Preceptron' are giving highest accuracy rate, but we choose 'Logistic Regression' as our Algorithm to build this Machine Learning Project. It was giving 97.36% accuracy rate with slightly underfitting with '97.80%' when Cross-Validate.

References: